This project is based on a dataset from Glass Door. It is built for various purposes:
(1) Exploration about the unigram term frequency, tf-idf, and bigram presence of the text in pro reviews, con reviews, and advice.
(2) What rating type has an outstanding pattern other than the rest and is worthy of further inspection?
(3) How does the sentiment scores shown in pro reviews, con reviews, and advice affect the rating of interest?
(4) Topic modelings for pro reviews, con reviews, and advice.
(5) For the topics modeled, how does rating impact the mapping of such topics?
The following packages are called in for this project.
library(dplyr)
library(tidyverse)
library(stringr)
library(tidytext)
library(tm)
library(rmarkdown)
library(wordcloud2)
library(lexicon)
library(textdata)
library(gganimate)
library(httr)
library(textstem)
library(widyr)
library(stopwords)
library(tibble)
library(NMF)
library(fmsb)
library(stm)
library(ggpubr)
library(lme4)
Load in the data.
load("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/glassDoor.rdata")
rmarkdown::paged_table(glassDoor)
There is no NA’s in the overall rating. This might be the rating of interest for further inspection given this perk. As the numbers of NA entries vary for other types of ratings that include work life rating, culture value rating, career opportunity rating, company benefit rating, and management rating.
summary(glassDoor)
pros cons advice rating
Length:1831 Length:1831 Length:1831 4.0:461
Class :character Class :character Class :character 3.0:370
Mode :character Mode :character Mode :character 5.0:551
2.0:170
1.0:279
workLifeRating cultureValueRating careerOpportunityRating
4.0 :431 5.0 :526 5.0 :393
5.0 :407 4.0 :340 4.0 :388
3.0 :363 1.0 :310 3.0 :342
1.0 :231 3.0 :302 1.0 :284
2.0 :195 2.0 :151 2.0 :202
(Other): 59 (Other): 48 (Other): 64
NA's :145 NA's :154 NA's :158
compBenefitsRating managementRating organization
4.0 :434 2.0 :160 Length:1831
3.0 :406 3.0 :275 Class :character
5.0 :327 4.0 :313 Mode :character
2.0 :237 1.0 :342
1.0 :229 5.0 :398
(Other): 33 NA's:343
NA's :165
Check if there is any missing entries in columns pros, cons, and advice. There are 1831 rows of entries, while there is none missing in column pros and cons; there are 655 NA’s in column advice, which is a 35.77% of the total rows of data - this is maybe something that we need to come back to later.
sum(is.na(glassDoor$pros))
[1] 0
sum(is.na(glassDoor$cons))
[1] 0
sum(is.na(glassDoor$advice))
[1] 655
An extra step taken to make sure the data types are in shape for further inspection.
glassDoor <- glassDoor %>%
mutate(organization = as.factor(organization)) %>%
mutate_at(c("rating", "workLifeRating", "cultureValueRating", "careerOpportunityRating", "compBenefitsRating", "managementRating"), as.numeric)
There are several issues with the text in pros, cons and advice: i.e. some words run together, non-english text and maybe more. These issues need to be addressed since collectively, they might affect the result of text analytics to a certain extent. Plus, stop words are removed for preparation purposes.
#library(stringi)
glassDoor$pros <- str_replace_all(glassDoor$pros, "[^[:alnum:]]", " ") %>%
str_squish(.) %>%
gsub("([a-z])([A-Z])", "\\1 \\2", .) %>%
tm::removeWords(., words = stopwords("en")) %>%
lemmatize_strings(.) %>%
tm::removeNumbers(.)
glassDoor$cons <- str_replace_all(glassDoor$cons, "[^[:alnum:]]", " ") %>%
str_squish(.) %>%
gsub("([a-z])([A-Z])", "\\1 \\2", .) %>%
tm::removeWords(., words = stopwords("en")) %>%
lemmatize_strings(.) %>%
tm::removeNumbers(.)
glassDoor$advice <- str_replace_all(glassDoor$advice, "[^[:alnum:]]", " ") %>%
str_squish(.) %>%
gsub("([a-z])([A-Z])", "\\1 \\2", .) %>%
tm::removeWords(., words = stopwords("en")) %>%
lemmatize_strings(.) %>%
tm::removeNumbers(.)
#glassDoor$pros <- stringi::stri_trans_general(glassDoor$pros, "latin-ascii")
#glassDoor$cons <- stringi::stri_trans_general(glassDoor$cons, "latin-ascii")
#glassDoor$advice <- stringi::stri_trans_general(glassDoor$advice, "latin-ascii")
#glassDoor$pros <- iconv(glassDoor$pros, to = "ASCII")
# remove the non-english text in pros, cons, and advice
germanText = grepl("den|der|die|ich", glassDoor$pros)
germanText = grepl("den|der|die|ich", glassDoor$cons)
germanText = grepl("den|der|die|ich", glassDoor$advice)
germanRows = which(germanText)
glassDoor = glassDoor[-c(germanRows), ]
Just for the convenience in future endeavor.
glassDoor$id <- seq.int(nrow(glassDoor))
To start with, who doesn’t love clouds?
glassDoor %>%
dplyr::select(pros) %>%
unnest_tokens(word, pros) %>%
count(word, sort = TRUE) %>%
anti_join(stop_words, by = "word") %>%
filter(n>50) %>%
na.omit() %>%
wordcloud2::wordcloud2(color = "darkseagreen", shape = "circle")
glassDoor %>%
dplyr::select(cons) %>%
unnest_tokens(word, cons) %>%
count(word, sort = TRUE) %>%
anti_join(stop_words, by = "word") %>%
filter(n>50) %>%
na.omit() %>%
wordcloud2::wordcloud2(color = "darksalmon", shape = "circle")
glassDoor %>%
dplyr::select(advice) %>%
# filter(!is.na(advice)) %>%
unnest_tokens(word, advice) %>%
count(word, sort = TRUE) %>%
anti_join(stop_words, by = "word") %>%
filter(n>15) %>%
na.omit() %>%
wordcloud2::wordcloud2(color = "cadetblue3", shape = "circle")
The table shows the top 15 words with highest term frequency in the pro reviews for each organization. Words like “company”, “people”, “managment” remain high on the list almost for each organization.
pro <- glassDoor %>%
select(pros, organization) %>%
mutate(pros = tolower(pros),
pros = lemmatize_strings(pros))
pro2 <- pro %>%
unnest_tokens(word, pros) %>%
count(organization, word, sort = TRUE) %>%
anti_join(stop_words, by = "word")
proTF <- pro %>%
split(., .$organization) %>%
lapply(., function(x) {
prosTokens = tm::MC_tokenizer(x$pros)
total = length(prosTokens)
})
temppro <- unlist(proTF)
proTF <- data.frame(organization = names(temppro),
total = temppro)
rownames(proTF) <- NULL
proTF <- pro2 %>%
left_join(., proTF, by = "organization")
proTF$tfreq <- proTF$n/proTF$total
proTFtop <- proTF %>%
group_by(organization) %>%
arrange(desc(tfreq)) %>%
slice(1:15)
rmarkdown::paged_table(proTFtop)
The table shows the top 15 words with highest term frequency in the con reviews for each organization. Words like “company”, “employee”, “managment” remain high on the list almost for each organization - pretty similar to those in the pro reviews.
con <- glassDoor %>%
select(cons, organization) %>%
mutate(cons = tolower(cons),
cons = lemmatize_strings(cons))
con2 <- con %>%
unnest_tokens(word, cons) %>%
count(organization, word, sort = TRUE) %>%
anti_join(stop_words, by = "word")
conTF <- con %>%
split(., .$organization) %>%
lapply(., function(x) {
consTokens = tm::MC_tokenizer(x$cons)
total = length(consTokens)
})
tempcon <- unlist(conTF)
conTF <- data.frame(organization = names(tempcon),
total = tempcon)
rownames(conTF) <- NULL
conTF <- con2 %>%
left_join(., conTF, by = "organization")
conTF$tfreq <- conTF$n/conTF$total
conTFtop <- conTF%>%
group_by(organization) %>%
arrange(desc(tfreq)) %>%
slice(1:15)
rmarkdown::paged_table(conTFtop)
There are plenty NA’s in the column advice. Establishing the term frequency table without putting these NA’s aside would result in NA being of the highest term frequency for all four organizations. Thus, get rid of those NA’s here.
The table shows the top 15 words with highest term frequency in the advice reviews for each organization. The words are practically the same as pro and con reviews for each company; to iterate, they are “company”, “employee”, “management”, and “people”.
advicedf <- glassDoor %>%
select(advice, organization) %>%
filter(!is.na(advice)) %>%
mutate(advice = tolower(advice),
advice = lemmatize_strings(advice))
advicedf2 <- advicedf %>%
unnest_tokens(word, advice) %>%
count(organization, word, sort = TRUE) %>%
anti_join(stop_words, by = "word")
adviceTF <- advicedf %>%
split(., .$organization) %>%
lapply(., function(x) {
adviceTokens = tm::MC_tokenizer(x$advice)
total = length(adviceTokens)
})
tempadvice <- unlist(adviceTF)
adviceTF <- data.frame(organization = names(tempadvice),
total = tempadvice)
rownames(adviceTF) <- NULL
adviceTF <- advicedf2 %>%
left_join(., adviceTF, by = "organization")
adviceTF$tfreq <- adviceTF$n/adviceTF$total
adviceTFtop <- adviceTF %>%
group_by(organization) %>%
arrange(desc(tfreq)) %>%
slice(1:15)
rmarkdown::paged_table(adviceTFtop)
In addition to term frequency, tf-idf is another valued concept to look at because it implies the words with greater importance as to the document.
This table shows the top 10 most “important” words in the pro reviews for each organization.
#proIDF <- proTF %>%
# group_by(word) %>%
# count() %>%
# mutate(idf = log(length(unique(proTF$organization))/n)) %>%
# arrange(desc(idf))
# rmarkdown::paged_table(proIDF)
proTFIDF <- proTF %>%
tidytext::bind_tf_idf(word, organization, n) %>%
arrange(desc(tf_idf)) %>%
select(-tfreq)
# overall
# proTFIDF
proTFIDFtop <- proTFIDF %>%
group_by(organization) %>%
arrange(desc(tf_idf)) %>%
slice(1:10)
rmarkdown::paged_table(proTFIDFtop)
This table shows the top 10 most “important” words in the con reviews for each organization. Interesting to see words like “bill” and “budget” pop up.
conTFIDF <- conTF %>%
tidytext::bind_tf_idf(word, organization, n) %>%
arrange(desc(tf_idf)) %>%
select(-tfreq)
conTFIDFtop <- conTFIDF %>%
group_by(organization) %>%
arrange(desc(tf_idf)) %>%
slice(1:10)
rmarkdown::paged_table(conTFIDFtop)
This table shows the top 10 most “important” words in the advice for each organization.
adviceTFIDF <- adviceTF %>%
tidytext::bind_tf_idf(word, organization, n) %>%
arrange(desc(tf_idf)) %>%
select(-tfreq)
adviceTFIDFtop <- adviceTFIDF %>%
group_by(organization) %>%
arrange(desc(tf_idf)) %>%
slice(1:10)
rmarkdown::paged_table(adviceTFIDFtop)
This table shows the bigrams with the highest presence in pro reviews, such as “life balance”, “learn lot”, “opportunity learn”, and “lot opportunity”.
proBigram <- glassDoor %>%
select(organization, pros) %>%
mutate(pros = tolower(pros)) %>%
mutate(pros = lemmatize_strings(pros)) %>%
unnest_tokens(bigram, pros, token = "ngrams", n = 2)
proBigram2 <- proBigram %>%
separate(bigram, c("word1", "word2"), sep = " ")
proBigram3 <- proBigram2 %>%
filter(!(word1 %in% stop_words$word)) %>%
filter(!(word2 %in% stop_words$word))
proBigramFinal <- proBigram3 %>%
unite(bigram, word1, word2, sep = " ") %>%
count(bigram, sort = TRUE)
rmarkdown::paged_table(proBigramFinal)
This table shows the bigrams with the highest presence in con reviews. “Life balance”, “senior management” show up again. And “low pay” and “low salary” are quite expected on this list.
conBigram <- glassDoor %>%
select(organization, cons) %>%
mutate(cons = tolower(cons)) %>%
mutate(cons = lemmatize_strings(cons)) %>%
unnest_tokens(bigram, cons, token = "ngrams", n = 2)
conBigram2 <- conBigram %>%
separate(bigram, c("word1", "word2"), sep = " ")
conBigram3 <- conBigram2 %>%
filter(!(word1 %in% stop_words$word)) %>%
filter(!(word2 %in% stop_words$word))
conBigramFinal <- conBigram3 %>%
unite(bigram, word1, word2, sep = " ") %>%
count(bigram, sort = TRUE)
rmarkdown::paged_table(conBigramFinal)
This table shows the bigrams with the highest presence in advice; interesting to see bigrams like “treat employee”, “care employee”, “listen employee”, “pay attention”, and “hr policy”.
adviceBigram <- glassDoor %>%
select(organization, advice) %>%
filter(!is.na(advice)) %>% # get rid of the NA's
mutate(advice = tolower(advice)) %>%
mutate(advice = lemmatize_strings(advice)) %>%
unnest_tokens(bigram, advice, token = "ngrams", n = 2)
adviceBigram2 <- adviceBigram %>%
separate(bigram, c("word1", "word2"), sep = " ")
adviceBigram3 <- adviceBigram2 %>%
filter(!(word1 %in% stop_words$word)) %>%
filter(!(word2 %in% stop_words$word))
adviceBigramFinal <- adviceBigram3 %>%
unite(bigram, word1, word2, sep = " ") %>%
count(bigram, sort = TRUE)
rmarkdown::paged_table(adviceBigramFinal)
To get a basic idea the average of all ratings (overall rating, work life rating, culture value rating, career opporunity rating, company benefit rating, and management rating) by organization/company, the following plot is generated. This step is conducted to see if any type of rating stands out, and is worthy of further inspection of the relationship between THE rating and sentiment/topic modeling.
From the result, we could see that the averages of all ratings are look alike to each other for each organization. Thus, the OVERALL RATING is picked as the factor to be of further inspection as this project proceeds.
avg_rating <- glassDoor %>%
group_by(organization) %>%
summarize(rating_avg = round(mean(rating, na.rm = TRUE), 2),
workLifeRating_avg = round(mean(workLifeRating, na.rm = TRUE), 2),
cultureValueRating_avg = round(mean(cultureValueRating, na.rm = TRUE), 2),
careerOpportunityRating_avg = round(mean(careerOpportunityRating, na.rm = TRUE), 2),
compBenefitsRating_avg = round(mean(compBenefitsRating, na.rm = TRUE), 2),
managementRating_avg = round(mean(managementRating, na.rm = TRUE), 2))
set.seed(1234)
avg_rating_temp <- avg_rating %>% remove_rownames %>% column_to_rownames(var = "organization") %>%
as.data.frame()
avg_rating_temp <- rbind(rep(5,5), rep(0,5), avg_rating_temp)
radarchart(avg_rating_temp, axistype = 1, cglcols = "grey", cglty=1, axislabcol="grey", caxislabels=seq(0,20,5), cglwd=0.8, vlcex = 0.7)
“nrc” lexicon is utilized here to conduct the sentiment analysis.
Preparation work
#get_sentiments("nrc")
nrcWord <- textdata::lexicon_nrc()
nrcValues <- lexicon::hash_sentiment_nrc
# average sentiment per id in cons
conSenti <- glassDoor %>%
select(cons, rating, organization, id) %>%
unnest_tokens(tbl = ., output = word, input = cons) %>%
inner_join(nrcValues, by = c('word' = 'x')) %>%
group_by(id) %>%
summarise(con_sentiment = round(mean(y),2))
#Average sentiment per id in pros
proSenti <- glassDoor %>%
select(pros, rating, organization, id) %>%
unnest_tokens(tbl = ., output = word, input = pros) %>%
inner_join(nrcValues, by = c('word' = 'x')) %>%
group_by(id) %>%
summarise(pro_sentiment = round(mean(y),2))
# Average sentiment per id in advice
adviceSenti <- glassDoor %>%
select(advice, rating, organization, id) %>%
unnest_tokens(tbl = ., output = word, input = advice) %>%
inner_join(nrcValues, by = c('word' = 'x')) %>%
group_by(id) %>%
summarise(advice_sentiment = round(mean(y),2))
Merge the three tables together.
glassDoor2 <- glassDoor %>%
left_join(conSenti, by = "id") %>%
left_join(proSenti, by = "id") %>%
left_join(adviceSenti, by = "id")
Get the table of the average sentiment per organization by pros, cons, and advice
From the table, we could see that just as expected the average sentiment scores assigned to the con reviews are much lower than those assigned to pro reviews and advice for every organization. The sentiment scores reflected from advice are slightly lower than the scores for pro reviews for every organization - which makes sense since the tone used to write advice tends to be more neutral from common sense.
Specifically, ORGC has the lowest sentiment scores (towards negativity) in both con and pro reivews. The sentiment for advice is most negative/lowest for ORGB, but it is one of the two organizations with the highest sentiment scores in pro reviews.
avgSenti <- glassDoor2 %>%
group_by(organization) %>%
summarize (con_sentiment_avg = round(mean(con_sentiment, na.rm = TRUE), 2),
pro_sentiment_avg = round(mean(pro_sentiment, na.rm = TRUE), 2),
advice_sentiment_avg = round(mean(advice_sentiment, na.rm = TRUE), 2))
rmarkdown::paged_table(avgSenti)
# merge the table showing the average ratings by organization and this table showing the average sentiment scores of pros, cons, and advice by organization
glassDoor_org <- avgSenti %>%
left_join(avg_rating, by = "organization") %>%
mutate(organization = as.factor(organization))
To get a basic idea the average of all ratings (overall rating, work life rating, culture value rating, career opporunity rating, company benefit rating, and management rating) by organization/company, the following plot is generated. This step is conducted to see if any type of rating stands out, and is worthy of further inspection of the relationship between THE rating and sentiment/topic modeling.
From the result, we could see that the averages of all ratings are look alike to each other for each organization. Thus, the OVERALL RATING is picked as the factor to be of further inspection as this project proceeds.
avg_rating <- glassDoor %>%
group_by(organization) %>%
summarize(rating_avg = round(mean(rating, na.rm = TRUE), 2),
workLifeRating_avg = round(mean(workLifeRating, na.rm = TRUE), 2),
cultureValueRating_avg = round(mean(cultureValueRating, na.rm = TRUE), 2),
careerOpportunityRating_avg = round(mean(careerOpportunityRating, na.rm = TRUE), 2),
compBenefitsRating_avg = round(mean(compBenefitsRating, na.rm = TRUE), 2),
managementRating_avg = round(mean(managementRating, na.rm = TRUE), 2))
set.seed(1234)
avg_rating_temp <- avg_rating %>% remove_rownames %>% column_to_rownames(var = "organization") %>%
as.data.frame()
avg_rating_temp <- rbind(rep(5,5), rep(0,5), avg_rating_temp)
radarchart(avg_rating_temp, axistype = 1, cglcols = "grey", cglty=1, axislabcol="grey", caxislabels=seq(0,20,5), cglwd=0.8, vlcex = 0.7)
From the output of this mode, we could see that pro sentiment and advice sentiment have significant impact on the rating; while con sentiment also impacts rating significantly but probably not on par with the effect from pro and advice.
glassDoor_mod <- glassDoor %>%
left_join(conSenti, by = "id") %>%
left_join(proSenti, by = "id") %>%
left_join(adviceSenti, by = "id")
mod1 <- lm(rating ~ pro_sentiment + con_sentiment+ advice_sentiment + organization, data = glassDoor_mod)
summary(mod1)
Call:
lm(formula = rating ~ pro_sentiment + con_sentiment + advice_sentiment +
organization, data = glassDoor_mod)
Residuals:
Min 1Q Median 3Q Max
-3.6024 -1.2921 -0.0133 1.2032 2.5402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.02889 0.15999 25.182 < 2e-16 ***
pro_sentiment -0.77921 0.13914 -5.600 3.17e-08 ***
con_sentiment -0.28687 0.09512 -3.016 0.00266 **
advice_sentiment -0.43610 0.09986 -4.367 1.47e-05 ***
organizationORGB -0.03360 0.14973 -0.224 0.82254
organizationORGC -0.08988 0.15061 -0.597 0.55086
organizationORGD -0.17007 0.15124 -1.124 0.26122
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.379 on 645 degrees of freedom
(1025 observations deleted due to missingness)
Multiple R-squared: 0.1053, Adjusted R-squared: 0.09695
F-statistic: 12.65 on 6 and 645 DF, p-value: 1.611e-13
To test the if the distribution of the dependent variable is normal.
Well… it might seem a bit far-fetched to do the mixed model on this issue as the normality of the distribution of rating appears as follows.
ggdensity(glassDoor_mod$rating, fill = "lightgrey")
ggqqplot(glassDoor_mod$rating)
A mixed model approach.
From the estimate of the variance explained by the random effect - it is indistinguishable from zero, so random effect probably doesn’t matter and a regular linear model will serve the purpose here instead. From the estimate of the fixed effect, we can see that the rating does differ given different pro, con, and advice sentiment. In other words, sentiment from pro, con, and advice all impact the rating.
mod2 <- lmer(rating ~ pro_sentiment + con_sentiment + advice_sentiment + (1|organization), data = glassDoor_mod, REML = FALSE)
summary(mod2)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula:
rating ~ pro_sentiment + con_sentiment + advice_sentiment + (1 |
organization)
Data: glassDoor_mod
AIC BIC logLik deviance df.resid
2276.1 2302.9 -1132.0 2264.1 646
Scaled residuals:
Min 1Q Median 3Q Max
-2.59850 -0.92862 -0.02767 0.85930 1.85368
Random effects:
Groups Name Variance Std.Dev.
organization (Intercept) 0.000 0.000
Residual 1.886 1.373
Number of obs: 652, groups: organization, 4
Fixed effects:
Estimate Std. Error t value
(Intercept) 3.96169 0.13457 29.439
pro_sentiment -0.78227 0.13849 -5.649
con_sentiment -0.29193 0.09432 -3.095
advice_sentiment -0.43335 0.09940 -4.360
Correlation of Fixed Effects:
(Intr) pr_snt cn_snt
pro_sentmnt -0.718
con_sentmnt -0.171 -0.111
advc_sntmnt -0.395 -0.132 -0.049
convergence code: 0
boundary (singular) fit: see ?isSingular
Preparation work
tmA <- glassDoor %>%
filter(organization == "ORGA") %>%
summarise(organization = "ORGA",
pros = paste(pros, collapse = " "),
cons = paste(cons, collapse = " "),
advice = paste(advice, collapse = " "),
rating = mean(rating, na.rm = TRUE))
tmB <- glassDoor %>%
filter(organization == "ORGB") %>%
summarise(organization = "ORGB",
pros = paste(pros, collapse = " "),
cons = paste(cons, collapse = " "),
advice = paste(advice, collapse = " "),
rating = mean(rating, na.rm = TRUE))
tmC <- glassDoor %>%
filter(organization == "ORGC") %>%
summarise(organization = "ORGC",
pros = paste(pros, collapse = " "),
cons = paste(cons, collapse = " "),
advice = paste(advice, collapse = " "),
rating = mean(rating, na.rm = TRUE))
tmD <- glassDoor %>%
filter(organization == "ORGD") %>%
summarise(organization = "ORGD",
pros = paste(pros, collapse = " "),
cons = paste(cons, collapse = " "),
advice = paste(advice, collapse = " "),
rating = mean(rating, na.rm = TRUE))
glassDoor_tm <- rbind(tmA, tmB, tmC, tmD)
Preparation work
pro_tm <- glassDoor_tm %>% as.data.frame()
colnames(pro_tm) <- c("doc_id", "text", "rating")
proCorpus = Corpus(DataframeSource(pro_tm))
proTDM <- TermDocumentMatrix(proCorpus, control = list(weighting =
function(x)
weightTfIdf(x, normalize = FALSE)))
inspect(proTDM)
<<TermDocumentMatrix (terms: 3502, documents: 4)>>
Non-/sparse entries: 4221/9787
Sparsity : 70%
Maximal term length: 25
Weighting : term frequency - inverse document frequency (tf-idf)
Sample :
Docs
Terms ORGA ORGB ORGC ORGD
delivery 0.000000 10.000000 0 0.000000
ensure 0.000000 0.000000 10 0.000000
find 6.225562 2.905262 0 4.150375
fmc 8.000000 0.000000 0 2.000000
orga 132.000000 0.000000 0 0.000000
orgb 0.000000 112.000000 0 0.000000
orgc 0.000000 0.000000 144 0.000000
orgcpany 0.000000 0.000000 10 0.000000
orgd 0.000000 0.000000 0 180.000000
something 4.000000 8.000000 0 0.000000
proConvert <- as.data.frame(as.matrix(proTDM))
proTibble = as_tibble(proConvert, .name_repair = "universal")
proTibble <- proTibble %>%
mutate_all(., funs(. + .1))
rownames(proTibble) <- rownames(proConvert)
proNMF <- nmf(proTibble, 4, seed = 1001)
save(proNMF, file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut.Rata")
load("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut.Rata")
wMatrix = as.data.frame(basis(proNMF))
head(wMatrix[order(-wMatrix$V1), ], 5)
V1 V2 V3 V4
orgb 15438.749 2.220446e-16 2.220446e-16 3.908536e-06
delivery 1390.977 1.413403e+01 1.098141e+01 1.611776e+01
enviornment 1115.531 1.441116e+01 1.119673e+01 1.643379e+01
satisfaction 1115.531 1.441116e+01 1.119673e+01 1.643379e+01
something 1115.114 1.394087e+01 4.943255e+02 1.589749e+01
head(wMatrix[order(-wMatrix$V2), ], 5)
V1 V2 V3 V4
orgc 3.668671e-05 22413.2119 2.220446e-16 2.220446e-16
ensure 1.279024e+01 1570.9150 1.122066e+01 1.646891e+01
orgcpany 1.279024e+01 1570.9150 1.122066e+01 1.646891e+01
orgcployees 1.298114e+01 1259.8359 1.138813e+01 1.671471e+01
attention 1.317204e+01 948.7569 1.155560e+01 1.696052e+01
head(wMatrix[order(-wMatrix$V3), ], 5)
V1 V2 V3 V4
orga 2.220446e-16 1.743540e-09 15955.3089 2.220446e-16
fmc 1.275900e+01 1.440669e+01 978.2945 3.714140e+02
find 4.129018e+02 1.402733e+01 763.6384 7.526571e+02
air 1.311997e+01 1.481427e+01 736.7512 1.689348e+01
ardent 1.311997e+01 1.481427e+01 736.7512 1.689348e+01
head(wMatrix[order(-wMatrix$V4), ], 5)
V1 V2 V3 V4
orgd 2.220453e-16 2.220816e-16 10.15161 31948.676
sharona 1.313385e+01 1.482995e+01 11.97329 1436.852
next 1.302973e+01 1.471238e+01 132.75550 1436.718
ceos 1.328657e+01 1.500239e+01 11.99447 1082.064
critical 1.328657e+01 1.500239e+01 11.99447 1082.064
set.seed(1001)
holdoutRows <- sample(1:nrow(glassDoor_mod), 100, replace = FALSE)
pro_text <- textProcessor(documents = glassDoor_mod$pros[-c(holdoutRows)],
metadata = glassDoor_mod[-c(holdoutRows), ],
stem = FALSE)
Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Removing numbers...
Creating Output...
pro_prep = prepDocuments(documents=pro_text$documents,
vocab = pro_text$vocab,
meta = pro_text$meta)
Removing 1787 of 3321 terms (1787 of 20611 tokens) due to frequency
Removing 1 Documents with No Words
Your corpus now has 1576 documents, 1534 terms and 18824 tokens.
To determine the best number of topics, focus on the semantic coherence and the residuals. Semantic coherence means how well the words hang together – computed from taking a conditional probability score from frequent words; it is essentially a measure of human interpretability. Thus, low residual and high semantic coherence is preferred. In this case, as k passes 5, the residuals seemingly take a sharp dive as k increases. Thus, choose 5 as the k.
ktest = searchK(documents = pro_prep$documents,
vocab = pro_prep$vocab,
K = c(3, 4, 5, 10, 20), verbose = FALSE)
plot(ktest)
Wih the 5 topics in hand, compute the actual model. The expected topic proportion plot is as follows. Here, topic 4 has the highest expected topic proportion, which kind of makes sense in real life context: as this is the topic modeling on the pro reviews, it is reasonable for words like great, work, company to be included in the highest mentioned topic.
topics5 <- stm(documents = pro_prep$documents,
vocab = pro_prep$vocab, seed = 1001,
K = 5, verbose = FALSE)
plot(topics5)
Now, see what emerges from the topics in details. FREX words are probably what we should focus on here, since they occur frequently within the topic and are exclusive to that topic. Here, in topic 4: FEEX words are great, company, learn, environment, get, life, salary - which are all rational for pro reviews. Also, the highest probability words are worthy of attention because there are the words with the highest probability of occuring within that topic. Here, under topic 4, highest probability words are: work, great, company, learn, environment, get, project - which are in high accordance with the FEEX words.
labelTopics(topics5)
Topic 1 Top Words:
Highest Prob: good, people, management, benefit, nice, job, experience
FREX: good, people, benefit, nice, job, exposure, fun
Lift: amcf, busy, cant, guide, immediate, induction, london
Score: good, people, benefit, nice, management, job, exposure
Topic 2 Top Words:
Highest Prob: team, can, firm, office, provide, year, consultant
FREX: office, provide, consultant, orga, also, need, family
Lift: apart, background, content, dress, ensure, external, fairly
Score: office, firm, team, consultant, need, professional, provide
Topic 3 Top Words:
Highest Prob: employee, lot, pay, culture, friendly, place, balance
FREX: employee, pay, culture, balance, consult, orgd, small
Lift: advisor, analytics, associate, atmosphäre, base, class, clientele
Score: employee, pay, orgd, culture, lot, place, consult
Topic 4 Top Words:
Highest Prob: work, great, company, learn, environment, get, project
FREX: great, company, learn, environment, get, life, salary
Lift: abwechslungsreiche, appear, arbeiten, assignment, auch, aufgaben, beratungsunternehmen
Score: work, great, company, learn, environment, get, life
Topic 5 Top Words:
Highest Prob: opportunity, client, time, make, flexible, help, train
FREX: client, time, make, train, support, orgc, really
Lift: advantage, already, ample, analyst, approachable, bankrupt, certification
Score: opportunity, client, orgc, make, time, give, train
Now, see what reviews that have higher probabilities of being associated with each topic.
findThoughts(topics5, texts = pro_prep$meta$pros, n = 1)
Topic 1:
Quiet little stressful Very good family Very good environment Nice ORGC town Good economy Good sport program
Topic 2:
When I first interview ORGD STEP STRATEGY see Glassdoor review I wrongfully doubt value summer internship program However spend summer ORGD STEP STRATEGY I thank Sharona team enough As University Virginia student I come program experience Series B start management consult experience To surprise I place beautiful open office overlook mountain range Brentwood one affluent neighborhood Los Angeles California On one first day I take site credit retailer million annual revenue work principal system analysis account merchandise floor plan customer relation system The office run quite like start company Each intern come elite school country bring unique perspective network program I privilege pitch idea principal firm high profile connection top CEOs Los Angeles Israel When idea determine valuable I give freedom work project throughout summer culminate event August meet C Suite management cut edge virtual reality augment reality gaming entertainment ad tech venture capital angel invest production cybersecurity technology firm Whether high growth start Silicon Beach just close M fund round billion dollar establish enterprise I see first hand Sharona command respect admiration power player business We even get meet Yossi Vardi Godfather Israel s hello tech industry now start ecosystem world billion invest first half next Silicon Valley Yes Sharona high expectation work deliver expectation help intern develop professional I believe previous Glassdoor reviewer just may ability take constructive criticism use inspire good work habit Each day access kitchen fill snack drink In fact either go free lunch order free lunch office discuss recent be A activity business news meal Bottom line I appreciate Sharona develop professionally turn amateur management consultant analyst contact throughout business world experience multi million dollar client I even recommend program good know firm ability work intimately experience Pw C principal give experience access I just another intern bottom totem pole big firm
Topic 3:
Volle Integration die Projektteams interessante Projekte auf Topmanagement Level sympathische und smarte Kollegen gute Work Life Balance für consult
Topic 4:
Kulturschnelles Wachstum Work Life Balance Ausgleich zwischen Beruf und Freizeit Umgang mit dem Senior Management des Kunden
Topic 5:
Never see company share knowledge freely everyone even competitor Civil engineer learn electrical plan design electrical engineer learn civil structure anything possible wish They make complete engineer Non metro grade A city benefit flexi time ease commute
Next step, since topic models are probablistic, compute he probabilities of each document belonging to a topic.
head(topics5$theta, 15)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.22539466 0.12575379 0.1370302 0.3161109 0.1957105
[2,] 0.20783320 0.11708519 0.1620109 0.3327056 0.1803651
[3,] 0.21384813 0.15223967 0.1689604 0.2796603 0.1852915
[4,] 0.16964886 0.17469019 0.1524817 0.2305308 0.2726484
[5,] 0.22389613 0.11284065 0.1546371 0.3417577 0.1668684
[6,] 0.34155517 0.08164882 0.1242106 0.3525590 0.1000265
[7,] 0.23563987 0.17120546 0.1657124 0.2761865 0.1512558
[8,] 0.10443301 0.30396297 0.2232076 0.1628533 0.2055431
[9,] 0.13258126 0.31363639 0.2109955 0.1873766 0.1554103
[10,] 0.09838425 0.17840294 0.1401331 0.2625235 0.3205563
[11,] 0.15684586 0.27350053 0.1615222 0.2436263 0.1645051
[12,] 0.26129997 0.14524674 0.1377146 0.3308216 0.1249171
[13,] 0.12951030 0.17127237 0.2118687 0.3298087 0.1575399
[14,] 0.11547104 0.30081425 0.2579578 0.1702397 0.1555172
[15,] 0.27911686 0.17869586 0.1375674 0.2808503 0.1237696
Since topic models, on the other hand, are great for predicting topic probabilities for unseen document.
newprotext = textProcessor(documents = glassDoor_mod$pros[holdoutRows],
metadata = glassDoor_mod[holdoutRows, ],
stem = FALSE)
Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Removing numbers...
Creating Output...
newprocorp = alignCorpus(new = newprotext, old.vocab = topics5$vocab)
Removing 1 Documents with No Words
Your new corpus now has 99 documents, 517 non-zero terms of 1534 total terms in the original set.
187 terms from the new data did not match.
This means the new data contained 33.7% of the old terms
and the old data contained 73.4% of the unique terms in the new data.
You have retained 1247 tokens of the 1444 tokens you started with (86.4%).
newprofitted = fitNewDocuments(model = topics5, documents = newprocorp$documents,
newData = newprocorp$meta, origData = pro_prep$meta)
...................................................................................................
Preparation work
con_tm <- glassDoor_tm %>% as.data.frame()
colnames(con_tm) <- c("doc_id", "text", "rating")
conCorpus = Corpus(DataframeSource(con_tm))
conTDM <- TermDocumentMatrix(conCorpus, control = list(weighting =
function(x)
weightTfIdf(x, normalize = FALSE)))
inspect(conTDM)
<<TermDocumentMatrix (terms: 3502, documents: 4)>>
Non-/sparse entries: 4221/9787
Sparsity : 70%
Maximal term length: 25
Weighting : term frequency - inverse document frequency (tf-idf)
Sample :
Docs
Terms ORGA ORGB ORGC ORGD
delivery 0.000000 10.000000 0 0.000000
ensure 0.000000 0.000000 10 0.000000
find 6.225562 2.905262 0 4.150375
fmc 8.000000 0.000000 0 2.000000
orga 132.000000 0.000000 0 0.000000
orgb 0.000000 112.000000 0 0.000000
orgc 0.000000 0.000000 144 0.000000
orgcpany 0.000000 0.000000 10 0.000000
orgd 0.000000 0.000000 0 180.000000
something 4.000000 8.000000 0 0.000000
conConvert <- as.data.frame(as.matrix(conTDM))
conTibble = as_tibble(conConvert, .name_repair = "universal")
conTibble <- conTibble %>%
mutate_all(., funs(. + .1))
rownames(conTibble) <- rownames(conConvert)
conNMF <- nmf(conTibble, 4, seed = 1001)
save(conNMF, file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut2.Rata")
load("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut2.Rata")
wMatrix2 = as.data.frame(basis(conNMF))
head(wMatrix2[order(-wMatrix2$V1), ], 5)
V1 V2 V3 V4
orgb 15438.749 2.220446e-16 2.220446e-16 3.908536e-06
delivery 1390.977 1.413403e+01 1.098141e+01 1.611776e+01
enviornment 1115.531 1.441116e+01 1.119673e+01 1.643379e+01
satisfaction 1115.531 1.441116e+01 1.119673e+01 1.643379e+01
something 1115.114 1.394087e+01 4.943255e+02 1.589749e+01
head(wMatrix2[order(-wMatrix2$V2), ], 5)
V1 V2 V3 V4
orgc 3.668671e-05 22413.2119 2.220446e-16 2.220446e-16
ensure 1.279024e+01 1570.9150 1.122066e+01 1.646891e+01
orgcpany 1.279024e+01 1570.9150 1.122066e+01 1.646891e+01
orgcployees 1.298114e+01 1259.8359 1.138813e+01 1.671471e+01
attention 1.317204e+01 948.7569 1.155560e+01 1.696052e+01
head(wMatrix2[order(-wMatrix2$V3), ], 5)
V1 V2 V3 V4
orga 2.220446e-16 1.743540e-09 15955.3089 2.220446e-16
fmc 1.275900e+01 1.440669e+01 978.2945 3.714140e+02
find 4.129018e+02 1.402733e+01 763.6384 7.526571e+02
air 1.311997e+01 1.481427e+01 736.7512 1.689348e+01
ardent 1.311997e+01 1.481427e+01 736.7512 1.689348e+01
head(wMatrix2[order(-wMatrix2$V4), ], 5)
V1 V2 V3 V4
orgd 2.220453e-16 2.220816e-16 10.15161 31948.676
sharona 1.313385e+01 1.482995e+01 11.97329 1436.852
next 1.302973e+01 1.471238e+01 132.75550 1436.718
ceos 1.328657e+01 1.500239e+01 11.99447 1082.064
critical 1.328657e+01 1.500239e+01 11.99447 1082.064
set.seed(1001)
holdoutRows <- sample(1:nrow(glassDoor_mod), 100, replace = FALSE)
con_text <- textProcessor(documents = glassDoor_mod$cons[-c(holdoutRows)],
metadata = glassDoor_mod[-c(holdoutRows), ],
stem = FALSE)
Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Removing numbers...
Creating Output...
con_prep = prepDocuments(documents=con_text$documents,
vocab = con_text$vocab,
meta = con_text$meta)
Removing 2158 of 4033 terms (2158 of 21548 tokens) due to frequency
Removing 3 Documents with No Words
Your corpus now has 1573 documents, 1875 terms and 19390 tokens.
In this case, as k passes 5, the residuals seemingly take a sharp dive as k increases. Thus, choose 5 as the k - yet again.
ktest2 = searchK(documents = con_prep$documents,
vocab = con_prep$vocab,
K = c(3, 4, 5, 10, 20), verbose = FALSE)
plot(ktest2)
Wih the 5 topics in hand, compute the actual model. The expected topic proportion plot is as follows. Here, topic 5 has the highest expected topic proportion, which contains words like company, good and much. Here “good” appears to be a bit dubious. The topic with the second highest expected proportion is topic 1.
topics5_2 <- stm(documents = con_prep$documents,
vocab = con_prep$vocab, seed = 1001,
K = 5, verbose = FALSE)
plot(topics5_2)
Now, see what emerges from the topics in details. FREX words are probably what we should focus on here, since they occur frequently within the topic and are exclusive to that topic. Here, in topic 5: FEEX words are company, good, lack, little, need, low, business - with the presence of words like “lack” and “need”, this topic being expected of high proportion is more rational now. Also, the highest probability words are worthy of attention because there are the words with the highest probability of occuring within that topic. Again, under topic 4, highest probability words are: company, good, much, lack, little, need, low. Here, the addition of “low” just makes more sense of topic 5’s pertaining to con reviews.
labelTopics(topics5_2)
Topic 1 Top Words:
Highest Prob: work, project, life, lot, long, balance, hard
FREX: project, life, lot, balance, hard, sometimes, compensation
Lift: acquisition, adapt, admin, affect, agree, allocation, als
Score: work, balance, life, und, die, project, sometimes
Topic 2 Top Words:
Highest Prob: can, people, hour, staff, firm, consult, year
FREX: people, staff, firm, small, new, orga, contract
Lift: acquire, active, always, apart, behind, bottom, bureaucratic
Score: can, staff, firm, people, year, support, none
Topic 3 Top Words:
Highest Prob: management, employee, pay, don, make, place, take
FREX: pay, take, even, leave, lead, week, never
Lift: act, advertise, analysis, begin, bond, bus, butt
Score: pay, employee, management, even, leave, take, don
Topic 4 Top Words:
Highest Prob: get, time, client, office, high, like, manager
FREX: time, client, office, high, like, opportunity, really
Lift: anywhere, avoid, base, box, cause, chase, club
Score: get, time, client, know, like, high, travel
Topic 5 Top Words:
Highest Prob: company, good, much, lack, little, need, low
FREX: company, good, lack, little, need, low, business
Lift: alabama, capacity, child, clean, competitor, conservative, coverage
Score: company, lack, good, team, con, little, business
Now, see what reviews that have higher probabilities of being associated with each topic.
findThoughts(topics5_2, texts = con_prep$meta$cons, n = 1)
Topic 1:
Schwerfällige und langsame Prozesse Projekte ohne wirkliche Auswirkungen Initiativen und Programme die nach Weiterleitung auf Geschäftsebenen verwässern Erfolglose Umstrukturierungsversuche
Topic 2:
Salary always play catch Mc Kinsey s Bain s BCG s result always seem bite behind The performance review process seem bite vague effort always link reward bonus promotion promotion affect good First level bonus almost unheard drop precipitously mark Work can dynamic interest gaggle mundane project good The project one staff luck draw room system advocate interest project exist
Topic 3:
The list endless appoint team manager autocratic approach Lunch break You force take half n hour lunch break team wish break timing also give reason literally say yes everything wish sustain organisation Else humiliate ask give resignation Also put false allegation resignation reason Unfair treatment due diligence give employee s feedback Half day salary Office timing start am can report late half day salary deduct Which also act de motivation s good come half day instead get half day salary deduct minute You allow take early leave leave actually require even ready submit medical document s another funny thing though have keep tea coffee vend machine machine operation just twice day have specifically keep fix hour mean can t even drink tea coffee per convenience do per wish No proper train give even thefresher candidate good place growth high attrition rate also mention reason
Topic 4:
frustrate people receive lead generator actually make contact account make introductory call next day send sheet time owner contact expect state time date state form In case inaccurate time So egg come chicken opportunity sale decrease near zero Also case I zero train manager assistant area manager something I promise occur I take position
Topic 5:
liar lack communication dishon ORGDst l ORGDad ORGDrship th ORGD gov ORGDrnm ORGDnt sit ORGD can t manag ORGD th ORGD ORGDmploy ORGDORGDs Mak ORGD ORGDxcus ORGDs l ORGDtting som ORGDon ORGD go win t list ORGDn th ORGD ORGDmploy ORGDORGD Would r ORGDcomm ORGDnd anyon ORGD job work company
Next step, since topic models are probablistic, compute he probabilities of each document belonging to a topic.
head(topics5_2$theta, 15)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.22869541 0.1729554 0.13122362 0.13299962 0.33412595
[2,] 0.39414666 0.2361382 0.08578816 0.11323388 0.17069307
[3,] 0.11002442 0.1577489 0.28969997 0.26877928 0.17374742
[4,] 0.10308559 0.1519916 0.28320292 0.30294445 0.15877542
[5,] 0.32811454 0.2071410 0.10056000 0.13780875 0.22637573
[6,] 0.37267083 0.1215989 0.25455147 0.15322175 0.09795707
[7,] 0.27169774 0.2030026 0.13030132 0.17937442 0.21562398
[8,] 0.13755266 0.1444468 0.20128626 0.41771536 0.09899892
[9,] 0.11144815 0.1944214 0.17141528 0.15262701 0.37008812
[10,] 0.13810604 0.2436554 0.16594474 0.19072494 0.26156883
[11,] 0.13909243 0.2399167 0.23121935 0.18332491 0.20644664
[12,] 0.08990613 0.1491018 0.14312706 0.11859870 0.49926632
[13,] 0.08990613 0.1491018 0.14312706 0.11859870 0.49926632
[14,] 0.67112038 0.1103481 0.05497527 0.07829753 0.08525876
[15,] 0.38029277 0.1744278 0.14968240 0.16452531 0.13107167
Preparation work
advice_tm <- glassDoor_tm %>% as.data.frame()
colnames(advice_tm) <- c("doc_id", "text", "rating")
adviceCorpus = Corpus(DataframeSource(advice_tm))
adviceTDM <- TermDocumentMatrix(adviceCorpus, control = list(weighting =
function(x)
weightTfIdf(x, normalize = FALSE)))
inspect(adviceTDM)
<<TermDocumentMatrix (terms: 3502, documents: 4)>>
Non-/sparse entries: 4221/9787
Sparsity : 70%
Maximal term length: 25
Weighting : term frequency - inverse document frequency (tf-idf)
Sample :
Docs
Terms ORGA ORGB ORGC ORGD
delivery 0.000000 10.000000 0 0.000000
ensure 0.000000 0.000000 10 0.000000
find 6.225562 2.905262 0 4.150375
fmc 8.000000 0.000000 0 2.000000
orga 132.000000 0.000000 0 0.000000
orgb 0.000000 112.000000 0 0.000000
orgc 0.000000 0.000000 144 0.000000
orgcpany 0.000000 0.000000 10 0.000000
orgd 0.000000 0.000000 0 180.000000
something 4.000000 8.000000 0 0.000000
adviceConvert <- as.data.frame(as.matrix(adviceTDM))
adviceTibble = as_tibble(adviceConvert, .name_repair = "universal")
adviceTibble <- adviceTibble %>%
mutate_all(., funs(. + .1))
rownames(adviceTibble) <- rownames(adviceConvert)
adviceNMF <- nmf(adviceTibble, 4, seed = 1001)
save(adviceNMF, file = "/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut3.Rata")
load("/Users/katliyx/Documents/SPRING2020/ITAO70250-Unstructured/Homework/HW2/nmfOut3.Rata")
wMatrix3 = as.data.frame(basis(adviceNMF))
head(wMatrix3[order(-wMatrix3$V1), ], 5)
V1 V2 V3 V4
orgb 15438.749 2.220446e-16 2.220446e-16 3.908536e-06
delivery 1390.977 1.413403e+01 1.098141e+01 1.611776e+01
enviornment 1115.531 1.441116e+01 1.119673e+01 1.643379e+01
satisfaction 1115.531 1.441116e+01 1.119673e+01 1.643379e+01
something 1115.114 1.394087e+01 4.943255e+02 1.589749e+01
head(wMatrix3[order(-wMatrix3$V2), ], 5)
V1 V2 V3 V4
orgc 3.668671e-05 22413.2119 2.220446e-16 2.220446e-16
ensure 1.279024e+01 1570.9150 1.122066e+01 1.646891e+01
orgcpany 1.279024e+01 1570.9150 1.122066e+01 1.646891e+01
orgcployees 1.298114e+01 1259.8359 1.138813e+01 1.671471e+01
attention 1.317204e+01 948.7569 1.155560e+01 1.696052e+01
head(wMatrix3[order(-wMatrix3$V3), ], 5)
V1 V2 V3 V4
orga 2.220446e-16 1.743540e-09 15955.3089 2.220446e-16
fmc 1.275900e+01 1.440669e+01 978.2945 3.714140e+02
find 4.129018e+02 1.402733e+01 763.6384 7.526571e+02
air 1.311997e+01 1.481427e+01 736.7512 1.689348e+01
ardent 1.311997e+01 1.481427e+01 736.7512 1.689348e+01
head(wMatrix3[order(-wMatrix3$V4), ], 5)
V1 V2 V3 V4
orgd 2.220453e-16 2.220816e-16 10.15161 31948.676
sharona 1.313385e+01 1.482995e+01 11.97329 1436.852
next 1.302973e+01 1.471238e+01 132.75550 1436.718
ceos 1.328657e+01 1.500239e+01 11.99447 1082.064
critical 1.328657e+01 1.500239e+01 11.99447 1082.064
set.seed(1001)
holdoutRows <- sample(1:nrow(glassDoor_mod), 100, replace = FALSE)
advice_text <- textProcessor(documents = glassDoor_mod$advice[-c(holdoutRows)],
metadata = glassDoor_mod[-c(holdoutRows), ],
stem = FALSE)
Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Removing numbers...
Creating Output...
advice_prep = prepDocuments(documents=advice_text$documents,
vocab = advice_text$vocab,
meta = advice_text$meta)
Removing 1126 of 2036 terms (1126 of 8407 tokens) due to frequency
Removing 15 Documents with No Words
Your corpus now has 934 documents, 910 terms and 7281 tokens.
In this case, as k passes 5, the residuals seemingly take a sharp dive as k increases - that drop is pretty spectacular. Thus, choose 5 as the k - again and again.
ktest3 = searchK(documents = advice_prep$documents,
vocab = advice_prep$vocab,
K = c(3, 4, 5, 10, 20), verbose = FALSE)
plot(ktest3)
For comments in the advice column, topic 3 has the highest expected topic proportion, and topic 5 follows. It is interesting to see words like “train” pop up since implication as “training opportunities” is pertinent to advice from many perspectives.
topics5_3 <- stm(documents = advice_prep$documents,
vocab = advice_prep$vocab, seed = 1001,
K = 5, verbose = FALSE)
plot(topics5_3)
Now, see what emerges from the topics in details. FREX words are probably what we should focus on here, since they occur frequently within the topic and are exclusive to that topic. Here, in topic 3: FEEX words are good, work, company, need, get, take, manager. “Need” really tells something about advice. Also, the highest probability words are worthy of attention because there are the words with the highest probability of occuring within that topic. Again, under topic 3, highest probability words are: employee, good, work, company, need, get, take. Here, high probability words kind of echo with the FREX words.
labelTopics(topics5_3)
Topic 1 Top Words:
Highest Prob: make, continue, client, team, respect, want, increase
FREX: continue, client, team, increase, career, stay, resource
Lift: instead, may, phone, practice, relationship, site, sure
Score: continue, team, make, client, practice, sure, increase
Topic 2 Top Words:
Highest Prob: management, don, time, pay, way, job, try
FREX: management, don, time, pay, way, job, level
Lift: job, advise, almost, also, another, answer, anyone
Score: management, don, time, pay, way, job, long
Topic 3 Top Words:
Highest Prob: employee, good, work, company, need, get, take
FREX: good, work, company, need, get, take, manager
Lift: acquire, adhere, amount, analyst, approach, assume, bright
Score: good, employee, work, company, need, get, just
Topic 4 Top Words:
Highest Prob: keep, staff, hire, great, treat, focus, much
FREX: keep, hire, great, treat, focus, much, place
Lift: action, base, consistent, cut, direction, everyone, focus
Score: keep, staff, hire, great, much, treat, place
Topic 5 Top Words:
Highest Prob: people, business, train, new, can, give, like
FREX: people, business, new, give, like, please, project
Lift: add, area, consultant, deliver, develop, experience, honest
Score: people, new, project, learn, like, give, business
Now, see what reviews that have higher probabilities of being associated with each topic.
findThoughts(topics5_3, texts = advice_prep$meta$advice, n = 1)
Topic 1:
Communicate little The Street want hear instead talk Front Line want hear question
Topic 2:
If I John Burgess owner I start fix tarnish lead database isn t hard If bad lead mention fix raise sale AND I WOULD FIGURE OUT A WAY TO PAY employee FOR THEIR OWN PERFORMANCE AND NOT FOR THE PERFORMANCE OF other
Topic 3:
Look employee listen one job can probably help figure work doesn t Retain employee massive turnover good employee client show unstable company employee afraid job review claim Get organize make plan Randomly change process policy without clear plan employee create chaos even harsh environment already Communicate company s plan employee whether sell company improve plan anything work
Topic 4:
I recommend much keep awesome Christmas party keep keep fun
Topic 5:
Giv ORGD th ORGD mang ORGDrs train b ORGDing b ORGDtt ORGDr communicator train List ORGDn th ORGD ORGDmbarrassing ORGDmploy ORGDORGDs Th ORGDy ar ORGD human have ORGDs ORGDrv ORGDd b ORGD h ORGDard
Next step, since topic models are probablistic, compute he probabilities of each document belonging to a topic.
head(topics5_3$theta, 15)
[,1] [,2] [,3] [,4] [,5]
[1,] 0.14627924 0.20513466 0.2427464 0.15521592 0.2506238
[2,] 0.09978360 0.23643414 0.2586945 0.22856616 0.1765216
[3,] 0.08243238 0.11316495 0.3046714 0.13412719 0.3656040
[4,] 0.09270360 0.13040056 0.1333031 0.06433677 0.5792560
[5,] 0.17558943 0.17532593 0.2312919 0.18935416 0.2284386
[6,] 0.07390013 0.23863909 0.3091568 0.23192541 0.1463786
[7,] 0.16758335 0.11784704 0.2242318 0.20023769 0.2901001
[8,] 0.15818393 0.15875001 0.2419165 0.16449827 0.2766513
[9,] 0.09326340 0.23056989 0.2839878 0.22495201 0.1672269
[10,] 0.07182122 0.50233641 0.2006376 0.11244249 0.1127623
[11,] 0.09503489 0.27850053 0.2773401 0.20889718 0.1402273
[12,] 0.11441160 0.25884813 0.2562035 0.17853134 0.1920055
[13,] 0.19742083 0.16236406 0.1610574 0.16301120 0.3161465
[14,] 0.08381568 0.06424385 0.1036296 0.08203162 0.6662793
[15,] 0.11441160 0.25884813 0.2562035 0.17853134 0.1920055
Preparation work
glassDoor3 <- glassDoor_mod %>%
drop_na(rating)
predictortext <- textProcessor(documents = glassDoor3$pros,
metadata = glassDoor3,
stem = FALSE)
Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Removing numbers...
Creating Output...
proprep2 <- prepDocuments(documents = predictortext$documents,
vocab = predictortext$vocab,
meta = predictortext$meta)
Removing 1863 of 3454 terms (1863 of 21932 tokens) due to frequency
Removing 1 Documents with No Words
Your corpus now has 1676 documents, 1591 terms and 20069 tokens.
The Prediction Models
For the topics of pro reviews, we could see that rating has significant impact on topics 1, 5, and on topic 4.
topicpredictor = stm(documents = proprep2$documents,
vocab = proprep2$vocab, prevalence = ~rating,
data = proprep2$meta, K = 5, verbose = FALSE)
proratingeffect = estimateEffect(1:5 ~ rating, stmobj = topicpredictor,
metadata = proprep2$meta)
summary(proratingeffect, topics = c(1:5))
Call:
estimateEffect(formula = 1:5 ~ rating, stmobj = topicpredictor,
metadata = proprep2$meta)
Topic 1:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.118978 0.004797 24.80 <2e-16 ***
rating 0.016395 0.001569 10.45 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 2:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.212271 0.006302 33.683 <2e-16 ***
rating 0.003071 0.002191 1.402 0.161
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 3:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.242782 0.006881 35.282 <2e-16 ***
rating -0.001880 0.002338 -0.804 0.421
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 4:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.199226 0.006519 30.563 <2e-16 ***
rating -0.004040 0.002254 -1.792 0.0733 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 5:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.226795 0.008510 26.649 < 2e-16 ***
rating -0.013545 0.002864 -4.729 2.44e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Plots.
Here, in each plot for each topic, the x-axes denote the ratings while the y-axes denote the corresponding expected topic proportion. As the rating decreases, the contents of pro reviews are mostly likely to fall under topic 3 (notice that there are words like “small”, which might imply “small company” or “small team”, which might work adversely when it comes to rating), then topic 5. As the rating increases, the contents of pro reviews tend to fall under topic 2.
plot.estimateEffect(proratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 1, labeltype = "frex")
plot.estimateEffect(proratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 2, labeltype = "frex")
plot.estimateEffect(proratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 3, labeltype = "frex")
plot.estimateEffect(proratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 4, labeltype = "frex")
plot.estimateEffect(proratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 5, labeltype = "frex")
Preparation work
predictortext2 <- textProcessor(documents = glassDoor3$cons,
metadata = glassDoor3,
stem = FALSE)
Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Removing numbers...
Creating Output...
conprep2 <- prepDocuments(documents = predictortext2$documents,
vocab = predictortext2$vocab,
meta = predictortext2$meta)
Removing 2212 of 4146 terms (2212 of 22558 tokens) due to frequency
Removing 2 Documents with No Words
Your corpus now has 1674 documents, 1934 terms and 20346 tokens.
The Prediction Models
For the topics of pro reviews, we could see that rating has significant impact on all topics from 1 to 5.
topicpredictor2 = stm(documents = conprep2$documents,
vocab = conprep2$vocab, prevalence = ~rating,
data = conprep2$meta, K = 5, verbose = FALSE)
conratingeffect = estimateEffect(1:5 ~ rating, stmobj = topicpredictor2,
metadata = conprep2$meta)
summary(conratingeffect, topics = c(1:5))
Call:
estimateEffect(formula = 1:5 ~ rating, stmobj = topicpredictor2,
metadata = conprep2$meta)
Topic 1:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.040494 0.009703 -4.173 3.16e-05 ***
rating 0.074916 0.003421 21.901 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 2:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.23574 0.01020 23.112 < 2e-16 ***
rating -0.01452 0.00335 -4.334 1.55e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 3:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.412832 0.013944 29.61 <2e-16 ***
rating -0.062860 0.004449 -14.13 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 4:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.266247 0.011423 23.308 < 2e-16 ***
rating -0.022380 0.003571 -6.267 4.66e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 5:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.125924 0.010924 11.527 < 2e-16 ***
rating 0.024727 0.003734 6.622 4.76e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Plots.
Here, in each plot for each topic, the x-axes denote the ratings while the y-axes denote the corresponding expected topic proportion. As the rating increases, the con reviews are most likely to fall under topic 1; as the rating decreases, the con reviews are most likely to fall under topic 3.
plot.estimateEffect(conratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 1, labeltype = "frex")
plot.estimateEffect(conratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 2, labeltype = "frex")
plot.estimateEffect(conratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 3, labeltype = "frex")
plot.estimateEffect(conratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 4, labeltype = "frex")
plot.estimateEffect(conratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 5, labeltype = "frex")
Preparation work
predictortext3 <- textProcessor(documents = glassDoor3$advice,
metadata = glassDoor3,
stem = FALSE)
Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Removing numbers...
Creating Output...
adviceprep2 <- prepDocuments(documents = predictortext3$documents,
vocab = predictortext3$vocab,
meta = predictortext3$meta)
Removing 1152 of 2101 terms (1152 of 8847 tokens) due to frequency
Removing 15 Documents with No Words
Your corpus now has 989 documents, 949 terms and 7695 tokens.
The Prediction Models
For the topics of advice, we could see that rating has significant impact on topics 2, 3, 4, and 5. Rating also has impact on topic 1, but probably not that significant compared with the rest.
topicpredictor3 = stm(documents = adviceprep2$documents,
vocab = adviceprep2$vocab, prevalence = ~rating,
data = adviceprep2$meta, K = 5, verbose = FALSE)
adviceratingeffect = estimateEffect(1:5 ~ rating, stmobj = topicpredictor3,
metadata = adviceprep2$meta)
summary(adviceratingeffect, topics = c(1:5))
Call:
estimateEffect(formula = 1:5 ~ rating, stmobj = topicpredictor3,
metadata = adviceprep2$meta)
Topic 1:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.158029 0.004212 37.516 <2e-16 ***
rating -0.003035 0.001303 -2.329 0.0201 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 2:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.222876 0.005778 38.574 < 2e-16 ***
rating -0.007847 0.001724 -4.553 5.96e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 3:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.166785 0.005989 27.847 <2e-16 ***
rating 0.006539 0.002130 3.069 0.0022 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 4:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.232235 0.007277 31.916 < 2e-16 ***
rating -0.009205 0.002254 -4.084 4.78e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Topic 5:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.220150 0.005818 37.838 < 2e-16 ***
rating 0.013508 0.001820 7.422 2.48e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Plots.
Here, in each plot for each topic, the x-axes denote the ratings while the y-axes denote the corresponding expected topic proportion. As the rating increases, the advice contents is likely to fall under topic 5. As the rating gets lower, the advice contents is likely to fall under topic 4 and then topic 2, with a small difference in the expected topic proportion.
plot.estimateEffect(adviceratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 1, labeltype = "frex")
plot.estimateEffect(adviceratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 2, labeltype = "frex")
plot.estimateEffect(adviceratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 3, labeltype = "frex")
plot.estimateEffect(adviceratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 4, labeltype = "frex")
plot.estimateEffect(adviceratingeffect, "rating", method = "continuous", model = topicpredictor, topics = 5, labeltype = "frex")
To answer the questions raised earlier in the introduction.
(1) The words with highest term frequencies appear similar in pro reviews, con reviews, and advice for the four organizations. However, the unigrams of higher tf-idf’s differ in pros, cons, and advice for every organization. Bigrams with higher presence differ for pro reviews, con reviews, and advice.
(2) Overall rating is the factor of interest to investigate on for this project.
(3) Sentiment reflected in pros, cons, and advice all impact the overall rating.
(4) Topic modelings for pro reviews, con reviews, and advice appear rational to some extent. For details, please refer back to the Topic Modeling section.
(5) Rating does have certain impact in the mapping of topics for pros, cons, and advice.
Regrettably, the word clouds seem not to be working that well (reasons to be discovered later). Also, I tried to clean up the German shown in the text, but that seems not to be that thorough as some German pops up in topic modeling.
Regrettably, the word clouds seem not to be working that well (reasons to be discovered later). Also, I tried to clean up the German shown in the text, but that seems not to be that thorough as some German pops up in topic modeling.